** Folder indicating where the files are located **
cd "C:\Users\remij\Desktop\Main Regressions"
set more off

* This do-file creates two data sets that we use for the regressions:
* "jjm_jebo_21.dta"
* "jjm_jebo_21_panel.dta" 

*********************************
*********************************
*** CREATION OF MAIN DATA SET ***
*********************************
*********************************

************************************************************
*** DATA FROM JEDWAB, MEIER ZU SELHAUSEN AND MORADI 2021 ***
************************************************************

* The data set "jjm_joeg_jebo_21" comes from the accompanying paper: "The Economics of Missionary Expansion: Evidence from Africa and Implications for Development" (as of 08-19-2021, the paper is R&R at the Journal of Economic Growth).
* See the replication files for the paper above for details on how the data set "jjm_joeg_jebo_21" was created. 
use jjm_joeg_jebo_21, clear
codebook gridcell
* 2,091 grid cells
codebook year
* 182 years (1751-1932)
sort gridcell year
save jjm_jebo_21, replace

**********************************************************************
*** ADDITIONAL VARIABLES CREATED USING VARIABLES FROM THE DATA SET ***
**********************************************************************

use jjm_jebo_21, clear

** Square, cube and fourth power of longitude and latitude **
gen longitude2 = longitude*longitude
gen longitude3 = longitude*longitude*longitude
gen longitude4 = longitude*longitude*longitude*longitude
foreach X in 2 3 4 {
label var longitude`X' "Longitude to the power `X'"
}
gen latitude2 = latitude*latitude
gen latitude3 = latitude*latitude*latitude
gen latitude4 = latitude*latitude*latitude*latitude
foreach X in 2 3 4 {
label var latitude`X' "Latitude to the power `X'"
}

** Cell urbanization rate in 2000 **
gen totpop_2000 = upop_2000+rpop_2000
label var totpop_2000 "Total population 2000"
* Urban population 2000 is missing for some cells.
* However, it is because these cells have no cities above 1,000.
* By construction, they have an urbanization rate of 0%.
gen urate2000 = upop_2000/totpop_2000
replace urate2000 = 0 if urate2000 == .
replace urate2000 = urate2000 * 100
count if urate != .
label var urate2000 "Urbanization rate 2000"

** Cells with a mission at any point **
gen prot = (cmeth0 >= 1 | cpresb0 >= 1 | coth0 >= 1)
gen cath = (ccath0 >= 1)
gen presb = (cpresb0 >= 1)
gen metho = (cmeth0 >= 1)
foreach X in prot cath presb metho {
bysort gridcell: egen max`X'_yn = max(`X')
}
foreach X in prot cath presb metho {
label var `X' "Dummy if mission type = `X' in the cell in t"
label var max`X'_yn "Dummy if mission type = `X' at one point"
}

save jjm_jebo_21, replace

*************************************************
*** ADDING VARIABLES FROM THE 2000 CENSUS ***
*************************************************

* We now add various variables that were created using raw data from the 2000 population census.
* Note that the data from the 2000 population census are propietary. 
* In the folder "Regressions\Data Compilation\2000 Census", the "README ON COPYRIGHT" file gives details on how access to the data was obtained and how the data can still be obtained.
* In the same folder, the do-file "JEBO_2000_census" then shows how the variables used here were created based on a data set that Jedwab & Moradi 2015 (REStat) created based on the raw data.
use census2000_gridlevel_jebo, clear
sort gridcell
save census2000_gridlevel_jebo, replace 

* We add to the main data set. 
use jjm_jebo_21, clear
sort gridcell 
merge gridcell using census2000_gridlevel_jebo
tab _m
drop _m

* We change the name of some variables or create new variables *

** RELIGIOUS SHARES **
foreach X in cath prot chri chri2 pent otchri prot2 isla trad {
ren rel_`X' cens_`X'_2000
}
* Catholic + protestants * 
ren cens_chri_2000 cens_cathprot_2000
label var cens_cathprot_2000 "Share of Catholics + Protestants in total population (%)"
* Catholic + protestants + other Christiants but not including Pent./Charismatic * 
gen cens_chri_nopent_2000 = cens_cathprot_2000+cens_otchri_2000
label var cens_chri_nopent_2000 "Share of Christians excl. Pent./Charismatic in total population (%)" 
* Christians (incl. Pentecostal & Other) *
ren cens_chri2_2000 cens_chri_2000
desc cens*
* Protestants core *
ren cens_prot_2000 cens_prot_core_2000
label var cens_prot_core_2000 "Share of Protestants in total population (%)"
* Protestants core + other Christians
gen cens_prot_nopent_2000 = cens_prot_core_2000+cens_otchri_2000
label var cens_prot_nopent_2000 "Share of Protestants + other Christians in total pop. (%)"
* Protestants core + other Christians + Pentecostal * 
ren cens_prot2_2000 cens_prot_2000
label var cens_prot_2000 "Share of Protestants (incl. Pentecostal & Other) in total population (%)"
desc cens*, f
foreach X of varlist cens*_* {
replace `X' = `X'/100
}

** OTHER MAIN OUTCOMES **
desc cogn_br lit_all
label var cogn_br "Share of skilled occupations among the workforce"
label var lit_all "Literacy rate (any language) in 15+ population (%)"
gen cd1549 = ceb1549-cs1549
sum ceb1549
sum cs1549
sum cd1549
gen childmort1549 = cd1549/ceb1549*100
sum childmort1549
label var ceb1549 "Number of children ever born (15-49 women)"
label var cs1549 "Number of children having survived (15-49 women)"
label var cd1549 "Number of children having died (15-49 women)"
label var childmort1549 "Child mortality rate (15-49 women)"
gen cd1549_rural = ceb1549_rural-cs1549_rural
gen childmort1549_rural = cd1549_rural/ceb1549_rural*100
sum childmort1549_rural
label var ceb1549_rural "Number of children ever born (15-49 women, rural)"
label var cs1549_rural "Number of children having survived (15-49 women, rural)"
label var cd1549_rural "Number of children having died (15-49 women, rural)"
label var childmort1549_rural "Child mortality rate (15-49 women, rural)"
gen cd1549_m = cebboys1549-csmale1549
gen cd1549_f = cebgirls1549-csfemale1549
gen childmort1549_m = cd1549_m/cebboys1549*100
gen childmort1549_f = cd1549_f/cebgirls1549*100
label var childmort1549_m "Child mortality rate (15-49 women, boys)"
label var childmort1549_f "Child mortality rate (15-49 women, girls)"
label var cd1549_m "Number of children having died (15-49 women, boys)"
label var cd1549_f "Number of children having died (15-49 women, girls)"
save jjm_jebo_21, replace

*****************************************
*** ADDING THE VARIABLES FROM THE DHS ***
*****************************************

* This data shows variables that were obtained from the DHS *
* We first combined various databases.
* Using ArcGIS and using the coordinates of each DHS cluster, we then added the grid cell of each individual. 
* Using this data, we then created the mean religious shares according to the DHS.
* Since the data is propietary, we do not include the raw data here (which can be accessed at https://dhsprogram.com/data/available-datasets.cfm)
* Instead, the do-file "Merge_GhanaDHS.do" in the folder "Regressions\Data Compilation\DHS" shows how the file "dhs_religion_individual" was created from the raw data. 

use "dhs_religion_individual.dta", clear
* We drop the observations that do not correspond to our grid.
drop if gridcell == ""
* We change the year of birth variable. 
replace v010 = 1900+v010 if v010 >= 1 & v010 <= 83
* We then create the age variable. 
gen age = dhsyear-v010
* We then create the variables we need for the religious share variables *
* Muslim/Christian from religion3
gen christian=(religion3=="Christian")
gen muslim=(religion3=="Muslim")
gen traditional=(religion3=="Traditional")
gen noreligion=(religion3=="no religion")
* Protestant/Catholic from religion3
gen protestant=(religion2=="Protestant")
replace protestant=. if religion2==""
gen catholic=(religion2=="Catholic")
replace catholic=. if religion2==""
* Presbyterian/methodist from religion1
gen presbyterian=( religion1=="Presbyterian")
replace presbyterian=. if religion1==""
gen methodist=( religion1=="Methodist")
replace methodist=. if religion1==""
gen pentecost_other=(religion1=="Pentecost\Other Christian")
replace pentecost_other=. if religion1==""
* Pentecostals from religion0
gen pentecost=(religion0=="Pentecost")
replace pentecost=. if religion0==""
* Number of observations per cluster *
gen count = 1
bysort dhsyear dhsclust: egen sumcount = sum(count)
drop count
ren sumcount numobs
sum numobs, d
* We keep the cells with at least 10 observations *
drop if numobs < 10
save dhs_data, replace

* We now obtain the mean religious shares across the various DHS years. 
use dhs_data, clear
collapse (mean) christian muslim traditional noreligion protestant catholic presbyterian methodist pentecost_other pentecost urban [pw=v005/1000000], by (v000 dhsyear grid)
collapse (mean) christian-pentecost urban, by(gridcell)
foreach X of varlist christian-pentecost urban {
ren `X' dhs_`X'_mean
}
drop if gridcell == ""
sort gridcell 
save dhs_relig_sh_mean, replace

* We then add them to the main data set. 
use jjm_jebo_21, clear
sort gridcell 
merge gridcell using dhs_relig_sh_mean
tab _m
drop _m
foreach X in christian muslim traditional noreligion protestant catholic presbyterian methodist pentecost_other pentecost urban {
label var dhs_`X'_mean "Mean cell share of group `X' in the DHS"
}
gen pres_meth_dhs = dhs_presbyterian_mean-dhs_methodist_mean
label var pres_meth_dhs "Presbyterian share - Methodist share in the DHS"
save jjm_jebo_21, replace 

*******************************************
*** ADDITIONAL DISTANCE-BASED VARIABLES ***
*******************************************

* We now create various distance-based variables that we add to the main data set.

** DISTANCE TO MISSION AT ANY POINT **

use jjm_jebo_21, clear
bysort gridcell: egen maxmissions_yn = max(missions_yn)
keep if maxmissions_yn == 1
keep if year == 1932
keep longitude latitude
ren longitude temp_long
ren latitude temp_lat
keep temp_long temp_lat
save dist_temp_points, replace

use jjm_jebo_21, clear
keep if year == 1932
keep gridcell longitude latitude 
sort gridcell
count
* 2091
cross using dist_temp_points
count
geodist latitude longitude temp_lat temp_long, gen(dist)
collapse (min) dist, by(gridcell)
ren dist dist2mission
replace dist2mission = 5 if dist2mission <= 5
gen ldist2mission = log(dist2mission)
gen ldist2mission_sq = ldist2mission*ldist2mission
sort gridcell 
save dist2mission, replace

use jjm_jebo_21, clear
sort gridcell 
merge gridcell using dist2mission
tab _m
drop _m
label var dist2mission "Dist. Any Mission"
label var ldist2mission "Log Dist. Any Mission"
label var ldist2mission_sq "Log Dist. Any Mission - Squared"
save jjm_jebo_21, replace

*** DISTANCE TO A MISSION OF EACH TYPE AT ANY POINT ***

foreach X in prot cath presb metho {
use jjm_jebo_21, clear
keep if max`X'_yn == 1
keep if year == 1932
count
keep longitude latitude
ren longitude temp_long
ren latitude temp_lat
keep temp_long temp_lat
save dist_temp_points, replace

use jjm_jebo_21, clear
keep if year == 1932
keep gridcell longitude latitude 
sort gridcell
count
* 2091
cross using dist_temp_points
count
geodist latitude longitude temp_lat temp_long, gen(dist)
collapse (min) dist, by(gridcell)
ren dist dist2mission_`X'
replace dist2mission_`X' = 5 if dist2mission_`X' <= 5
gen ldist2mission_`X' = log(dist2mission_`X')
gen ldist2mission_`X'_sq = ldist2mission_`X'*ldist2mission_`X'
sort gridcell 
save dist2mission_`X', replace
}

foreach X in prot cath presb metho {
use jjm_jebo_21, clear
sort gridcell 
merge gridcell using dist2mission_`X'
tab _m
drop _m
label var dist2mission_`X' "Dist. Any Mission: Type = `X'"
label var ldist2mission_`X' "Log Dist. Any Mission: Type = `X'"
label var ldist2mission_`X'_sq "Log Dist. Any Mission - Squared: Type = `X'"
save jjm_jebo_21, replace
}

*** DISTANCE TO THE FIRST MISSION OF EACH DENOMINATION ***

use jjm_jebo_21, clear
foreach X in dist2firstpresb dist2firstmetho dist2firstcatho {
replace `X' = 5 if `X' <= 5
gen l`X' = log(`X')
}
foreach X in catho presb metho {
label var dist2first`X' "Dist. Any Mission: Type = `X'"
label var ldist2first`X' "Log Dist. Any Mission: Type = `X'"
}
save jjm_jebo_21, replace

*** DISTANCE TO NAVRONGO (FIRST CATHOLIC MISSION IN THE NORTH) ***

use jjm_jebo_21, clear
keep if year == 1932
* Navrongo = Y7
keep if gridcell == "Y7"
keep longitude latitude
ren longitude long_temp 
ren latitude lat_temp
save temp_points, replace

use jjm_jebo_21, clear
keep if year == 1932
keep gridcell longitude latitude 
cross using temp_points
geodist latitude longitude lat_temp long_temp, gen(dist2navrongo)
collapse (min) dist2navrongo, by(gridcell)
count
* 2091
replace dist2navrongo = 5 if dist2navrongo == 0
gen ldist2navrongo = log(dist2navrongo)
label var dist2navrongo "Distance to Navrongo"
label var ldist2navrongo "Log Distance to Navrongo"
sort gridcell
save dist2navrongo, replace

use jjm_jebo_21, clear
sort gridcell 
merge gridcell using dist2navrongo 
tab _m
drop _m
egen ldist2firstcathonavr = rmin(ldist2firstcatho ldist2navrongo)
label var ldist2firstcathonavr "Log Dist. 1st Catho. Mission or Navrongo"
save jjm_jebo_21, replace

***********************************
*** NIGHT LIGHTS: NOT TOP-CODED ***
***********************************

* The NOOA nightlight data (not top coded) was downloaded from <https://ngdc.noaa.gov/eog/dmsp/download_radcal.html>. Using Zonal Statistics in QGIS, we then calculate summary statitics by 0.1 x 0.1. grid in which the pixels fall. The Ghana grid shapefile can be found in the data folder.
* See the "Readme" file in the folder "Regressions\Data Compilation\Night Lights" for details on how the data set was created.
use nl_ghana_corrtc, clear
sort gridcell
save nl_ghana_corrtc, replace

* Adding night lights (corrected for top coding) *
use jjm_jebo_21, clear
sort gridcell
merge gridcell using nl_ghana_corrtc, update
tab _m
drop _m
foreach X in 1996 1999 2000 2003 2004 2006 2010 2011 {
gen lnltc`X' = log(nltc`X'+1)
label var nltc`X' "Night light intensity (not top coded) in year `X'"
label var lnltc`X' "Log (night light intensity + 1) (not top coded) in year `X'"
}
save jjm_jebo_21, replace

**********************************************************
*** VARIABLES NEEDED FOR THE IDENTIFICATION STRATEGIES ***
**********************************************************

* Distances in <Spheres of Influence.dta> were created using QGIS. The dbase file was then imported into Stata.
* The shapefiles for the Muslim and Protestant spheres of influence, the straight line separating the Protestant spheres as well as the Ghana grid are in "Regressions\Data Compilation\Spheres of Influence Data"

use "Spheres_of_influence_grid 03222020.dta", clear
* We keep some variables. 
keep gridcell dist2segment dist2sphere_strline presby1867 method1867 nearest_segment west_of_* *mus*
sort gridcell
save "spheres_influence_vars.dta", replace

* We also need the following variables below *

** DISTANCE TO THE PROTESTANT SPHERES **

clear
use "spheres_influence_vars.dta", clear
gen protall = (presby1867 == 1 | method1867 == 1)
keep if protall == 1
keep gridcell protall
sort gridcell
save protall, replace

foreach X in protall {
use jjm_jebo_21, clear
keep if year == 1932
sort gridcell
merge gridcell using protall
keep if protall == 1
keep longitude latitude
ren longitude temp_long
ren latitude temp_lat
keep temp_long temp_lat
save dist_temp_points, replace

use jjm_jebo_21, clear
keep if year == 1932
keep gridcell longitude latitude 
sort gridcell
cross using dist_temp_points
count
geodist latitude longitude temp_lat temp_long, gen(dist)
collapse (min) dist, by(gridcell)
ren dist dist2`X'
replace dist2`X' = 5 if dist2`X' <= 5
gen ldist2`X' = log(dist2`X')
gen ldist2`X'_sq = ldist2`X'*ldist2`X'
sort gridcell 
save dist2`X', replace
}

** DISTANCE TO A MISSION OF EACH TYPE IN 1880 ***

foreach X in cath prot {
use jjm_jebo_21, clear
keep if year == 1880
drop prot cath maxprot_yn maxcath_yn 
gen prot = (cmeth0 >= 1 | cpresb0 >= 1 | coth0 >= 1)
gen cath = (ccath0 >= 1)
bysort gridcell: egen max`X'_yn = max(`X')
keep if max`X'_yn == 1
keep longitude latitude
ren longitude temp_long
ren latitude temp_lat
keep temp_long temp_lat
save dist_temp_points, replace
count
use jjm_jebo_21, clear
keep if year == 1932
keep gridcell longitude latitude 
sort gridcell
cross using dist_temp_points
count
geodist latitude longitude temp_lat temp_long, gen(dist)
collapse (min) dist, by(gridcell)
ren dist dist2mission_`X'1880
replace dist2mission_`X'1880 = 5 if dist2mission_`X'1880 <= 5
gen ldist2mission_`X'1880 = log(dist2mission_`X'1880)
gen ldist2mission_`X'1880_sq = ldist2mission_`X'1880*ldist2mission_`X'1880
sort gridcell 
save dist2mission_`X'1880, replace
}

*** DISTANCE TO LOME IN TOGO ***

* Data set with the coordinates of Lome *
clear
import excel "lomne.xlsx", sheet("Sheet1") firstrow
keep if city == "Lome"
ren longitude temp_long
ren latitude temp_lat
keep temp_long temp_lat
save dist_temp_points, replace
count

* We create the distance variable. 
use jjm_jebo_21, clear
keep if year == 1932
keep gridcell longitude latitude 
sort gridcell
cross using dist_temp_points
geodist latitude longitude temp_lat temp_long, gen(dist)
collapse (min) dist, by(gridcell)
ren dist dist2lome
replace dist2lome = 5 if dist2lome <= 5
gen ldist2lome = log(dist2lome)
sort gridcell 
save dist2lome, replace

*** DISTANCE TO JIRAPA ***

use jjm_jebo_21, clear
* Cell of Jirapa
keep if gridcell == "H10" & year == 1932
keep longitude latitude
ren longitude long_temp 
ren latitude lat_temp
save temp_points, replace

use jjm_jebo_21, clear
keep gridcell longitude latitude 
cross using temp_points
geodist latitude longitude lat_temp long_temp, gen(dist2jirapa)
collapse (min) dist2jirapa, by(gridcell)
replace dist2jirapa = 5 if dist2jirapa == 0
gen ldist2jirapa = log(dist2jirapa)
sort gridcell
save dist2jirapa, replace

*** DISTANCE TO A CATHOLIC MISSION IN 1931 ***

use jjm_jebo_21, clear
keep if year == 1931
keep if catho_yn == 1 
keep longitude latitude
ren longitude long_temp 
ren latitude lat_temp
save temp_points, replace

foreach X in 1931 {
use jjm_jebo_21, clear
keep if year == 1932
keep gridcell longitude latitude 
cross using temp_points
geodist latitude longitude lat_temp long_temp, gen(dist2cath`X')
collapse (min) dist2cath`X', by(gridcell)
replace dist2cath`X' = 5 if dist2cath`X' == 0
gen ldist2cath`X' = log(dist2cath`X')
sort gridcell
save dist2cath`X', replace
}

*** DISTANCE TO THE NORTHERN BORDER ***

clear
import delimited "northern_border.csv", clear 
keep gridcell 
gen northborder = 1
sort gridcell
save northborder, replace 

use jjm_jebo_21, clear
keep if year == 1932
keep gridcell longitude latitude 
sort gridcell 
merge gridcell using northborder
tab _m
drop _m
keep if northborder == 1
keep longitude latitude 
ren longitude long_border 
ren latitude lat_border
save list_for_northborder, replace

use jjm_jebo_21, clear
keep if year == 1932
keep gridcell longitude latitude 
cross using list_for_northborder
geodist latitude longitude lat_border long_border, gen(dist2northborder)
collapse (min) dist2northborder, by(gridcell)
replace dist2northborder = 5 if dist2northborder == 0
gen ldist2northborder = log(dist2northborder)
sort gridcell
save dist2northborder, replace

** CELLS WITH PRESBYTERIAN OR METHODIST MISSIONS BEFORE 1847 (INCL.) **

use jjm_jebo_21, clear
keep if year <= 1847
keep if presb == 1 | metho == 1
bysort gridcell: keep if _n == 1
keep gridcell
gen methopresb1847 = 1
sort gridcell
save methopresb1847, replace

** WE MERGE WITH THE MAIN DATA SET AND CREATE NEW VARIABLES **

use jjm_jebo_21, clear
* We add the main variables. 
sort gridcell
merge gridcell using "spheres_influence_vars.dta"
tab _m
drop _m
* We then create additional variables *
*** Strategy: Distance to the Protestant boundary ***
gen ldist2segment = log(dist2segment)
label var dist2segment "Distance of centroid to the Protestant boundary (km)"
label var ldist2segment "Log distance of centroid to the Protestant boundary (km)"
gen dist2segment_sq = dist2segment*dist2segment
gen dist2segment_cub = dist2segment*dist2segment*dist2segment
gen dist2segment_four = dist2segment*dist2segment*dist2segment*dist2segment
gen ldist2segment_sq = ldist2segment*ldist2segment
gen ldist2segment_cub = ldist2segment*ldist2segment*ldist2segment
gen ldist2segment_four = ldist2segment*ldist2segment*ldist2segment*ldist2segment
foreach X in sq cub four {
label var dist2segment_`X' "dist2segment - `X'"
label var ldist2segment_`X' "ldist2segment - `X'"
}
* Distance to the straight line of the Protestant boundary *
gen ldist2sphere_strline = log(dist2sphere_strline)
label var dist2sphere_strline "Distance of centroid to the straight Protestant boundary (km)"
label var ldist2sphere_strline "Log distance of centroid to the straight Protestant boundary (km)"
* Areas of competition
gen raceTOkumasi = (nearest_segment == "Race to Kumasi")
gen racePOSTkumasi = (nearest_segment == "Straight line beyond Kumasi")
label var raceTOkumasi "Area of competition for the race to Kumasi"
label var racePOSTkumasi "Area of competition after Kumasi"
* Dummies if East of the boundaries
gen east_of_bound = (west_of_bound == 0)
gen east_of_strline = (west_of_strline == 0)
drop west_of*
label var east_of_bound "Cell centroid is east of Protestant boundary"
label var east_of_strline "Cell centroid east of straight line boundary"
* Other variables related to the Protestant spheres.
gen race1867 = (raceTOkumasi == 1 | racePOSTkumasi == 1)
label var race1867 "Area of competition below and after Kumasi"
gen sphere1867 = (presby1867 == 1 | method1867 == 1) 
label var sphere1867 "Dummy if belongs to one of the spheres of influence"
sort gridcell
merge gridcell using methopresb1847
tab _m
drop _m
*** Strategy: Variables related to the buffer strategy ***
* These variables capture the distance to the Muslim areas. 
* We should include the cells whose centroid is within 5 km. 
foreach X in dist2mus_significant dist2mus_noticeable {
replace `X' = 5 if `X' <= 5 
}
egen dist2mus_all = rmin(dist2mus_significant dist2mus_noticeable)
label var dist2mus_all "Distance to the Muslim sphere of influence"
ren dist2mus_all dist2muslimall
gen ldist2muslimall = log(dist2muslimall)
label var ldist2muslimall "Log distance to the Muslim sphere of influence"
* We include the measures capturing the distance to the spheres. 
sort gridcell
merge gridcell using dist2protall
tab _m
drop _m
label var dist2protall "Distance to Protestant spheres"
label var ldist2protall "Log distance to Protestant spheres"
label var ldist2protall_sq "Log distance to Protestant spheres - squared"
* We create the buffer variable.
gen proddistcore = ldist2protall*ldist2muslimall 
label var proddistcore "Log Dist. Prot. Area*Log Dist. Muslim Area"
gen muslimall  = (dist2muslimall <= 5)
label var muslimall "Muslim area"
gen sphere_all = (presby1867 == 1 | method1867 == 1 | muslimall == 1)
label var sphere_all "Protestant sphere or Muslim area"
* We add the control for the distance to Catholic missions in 1880.
sort gridcell
merge gridcell using dist2mission_cath1880
tab _m
drop _m
label var dist2mission_cath1880 "Distance to Catholic mission 1880"
label var ldist2mission_cath1880 "Log distance to Catholic mission 1880"
label var ldist2mission_cath1880_sq "Log distance to Catholic mission 1880 - squared"
* We add the control for the distance to Lome.
sort gridcell
merge gridcell using dist2lome
tab _m
drop _m
label var dist2lome "Distance to Lome"
label var ldist2lome "Log distance to Lome"
*** Strategy: We now add the variables for the Jirapa miracle strategy ***
* We add the variable for the Jirapa strategy
sort gridcell
merge gridcell using dist2jirapa
tab _m
drop _m
label var dist2jirapa "Distance to Jirapa"
label var ldist2jirapa "Log distance to Jirapa"
* We add the control for the distance to Catholic missions in 1931 *
sort gridcell
merge gridcell using dist2cath1931
tab _m
drop _m
label var dist2cath1931 "Distance to a Catholic mission in 1931"
label var ldist2cath1931 "Log distance to a Catholic mission in 1931"
* We add the control for the distance to the Northern border
sort gridcell
merge gridcell using dist2northborder
tab _m
drop _m
label var dist2northborder "Distance to the Northern border in 1931"
label var ldist2northborder "Log distance to the Northern border in 1931"
* We add a control for whether there is a catholic mission in 1931.
foreach X in 1931 {
gen catho_yn_`X'2 = catho_yn if year == `X'
bysort gridcell: egen catho_yn_`X' = max(catho_yn_`X'2)
drop catho_yn_`X'2
}
label var catho_yn_1931 "Dummy if there is a Catholic mission in 1931"
* Other variables *
gen pres_meth_mis = maxpresb_yn-maxmetho_yn
label var pres_meth_mis "(Presb.-Metho.) Mission Pre1932"
gen pres_meth_dist = ldist2mission_presb-ldist2mission_metho
label var pres_meth_dist "(Presb-Metho) Log Dist Mis. Pre32"
label var methopresb1847 "Dummy if in one of the spheres"
save jjm_jebo_21, replace

**************************************
*** NEW URBAN POPULATION 1891-1931 ***
**************************************

* The urban population controls we used so far were the ones used by Jedwab, Meier zu Selhausen and Moradi 2021 who then used the population data from Jedwab and Moradi 2016.
* We now use and modify the original city population data from Jedwab & Moradi 2016 to obtain the total city population of each cell every ten years from 1891 to 1931.
* Remi Jedwab & Alexander Moradi, 2016. "The Permanent Effects of Transportation Revolutions in Poor Countries: Evidence from Africa," The Review of Economics and Statistics, MIT Press, vol. 98(2), pages 268-284, May.
* The database shows the population of all cities above 1000 at one point.
* When available, it also shows their population estimate below 1000. 
clear
import delimited "urban_gh_03182020_v2.csv"
keep gridcell pop1891-pop1931
* We drop the ones that do not belong to a cell.
drop if gridcell == ""
* We replace by 0 when it is missing. 
foreach X in 1891 1901 1911 1921 1931 {
gen pop`X'_1000 = 0
replace pop`X'_1000 = pop`X' if pop`X' >= 1000
}
* We keep the ones above 1000 at one point.
keep if pop1891 >= 1000 | pop1901 >= 1000 | pop1911 >= 1000 | pop1921 >= 1000 | pop1931 >= 1000
* We obtain the sums for each cell. 
collapse (sum) pop*, by(gridcell)
* We create two versions, one where we ignore the pop. below 1000 and one where we keep it.
foreach X in 1891 1901 1911 1921 1931 {
ren pop`X' newcitypop`X'
replace newcitypop`X' = round(newcitypop`X',1)
ren pop`X'_1000 newcitypop`X'_1000
replace newcitypop`X'_1000 = round(newcitypop`X'_1000,1)
}
foreach X in 1891 1901 1911 1921 1931 {
label var newcitypop`X'_1000 "Sum of city pop. for cities > 1000, year `X'"
label var newcitypop`X' "Sum of city pop. for cities > 1000, incl. est. < 1000, year `X'"
}
sort gridcell
save newcitypop_18911931, replace 

** We now create the panel data from this data **
use newcitypop_18911931, clear
foreach X in 1891 1901 1911 1921 1931 {
ren newcitypop`X'_1000 newcitypop1k`X'
}
reshape long newcitypop newcitypop1k, i(gridcell) j(year)
sort gridcell year
save newcitypop, replace

** We merge with the main data set **
use jjm_jebo_21, clear
sort gridcell year
merge gridcell year using newcitypop, update
tab _m
drop if _m == 2
drop _m
label var newcitypop "Sum of city pop. for cities > 1000, incl. est. < 1000"
label var newcitypop1k "Sum of city pop. for cities > 1000"
save jjm_jebo_21, replace

*******************
*** 1931 CENSUS ***
*******************

** We now use variables that we created using data from the 1931 population census **
* See the folder "Regressions\Data Compilation\1931 Census" for details. 
use "census1931_grid 03222020.dta", clear 
gen total1545 = m15_45+f15_45
keep gridcell education-mental n_compounds total1545
gen infirm = lepers+blindness+deaf+mental
collapse (sum) education-mental infirm n_compounds total1545, by(gridcell)
ren n_compounds compound
* We obtain the shares for each cell *
foreach X of varlist education-mental compound infirm {
replace `X' = `X'/total1545
ren `X' `X'_cap1545_1931 
sum `X'_cap1545_1931 , d
}
drop total
drop if gridcell == ""
drop compound_cap1545_1931
* We keep the two outcomes we use in the paper. 
keep gridcell education_cap1545_1931 infirm_cap1545_1931
label var education_cap1545_1931 "Proportion of the number of 'educated' individuals to the pop. of 15-45 year-olds in 1931"
label var infirm_cap1545_1931 "Proportion of the number of 'infirmities' to the pop. of 15-45 year-olds in 1931"
sort gridcell
save census1931_outcomes, replace 

** We add to the main data set **
use jjm_jebo_21, clear
sort gridcell 
merge gridcell using census1931_outcomes
tab _m
drop _m
* We also create the urbanizaton rate for the year 1931 (from the pop. census data).
gen pop_1931 = upop_1931+rpop_1931
gen urate1931 = upop_1931/pop_1931
sum urate1931, d
label var pop_1931 "Total population in 1931"
label var urate1931 "Urbanization rate in 1931"
save jjm_jebo_21, replace
* We verify the variables are properly labeled
desc *, f

********************************************
*** PANEL DATA FOR TABLE 10 AND FIGURE 7 ***
********************************************

use jjm_jebo_21, clear
** We first create a few more variables **
* First year a mission was opened in the cell
gen opened_yn = (missions_yn == 1 & missions_yn[_n-1] == 0)
gen year_opened = year if opened_yn == 1 & year >= 1891 & year <= 1931
bysort gridcell: egen minyearopen = min(year_opened)
drop opened_yn year_opened
* First year a Protestant mission was opened in the cell
foreach X in prot presb {
gen opened_yn = (`X' == 1 & `X'[_n-1] == 0)
gen year_opened = year if opened_yn == 1 & year >= 1891 & year <= 1931
bysort gridcell: egen minyearopen_`X' = min(year_opened)
drop opened_yn year_opened
}
* We only keep some selected years
keep if year == 1881 | year == 1891 | year == 1901 | year == 1911 | year == 1921 | year == 1931
sort gridcell year
keep if year >= 1891 & year <= 1931
sort gridcell year
* We only keep cells with missions at one point. 
gen opened_yn = (missions_yn == 1 & missions_yn[_n-1] == 0) if year >= 1891
tab year opened_yn, m
bysort gridcell: egen maxopened_yn = max(opened_yn)
tab maxopened_yn if year == 1931
keep if maxopened_yn == 1
tab year
* 387 opened at one point in 1891-1931

* We create a few variables related to city population.
replace newcitypop = 0 if newcitypop == . & year != 1881 & year != 1871
replace newcitypop1k = 0 if newcitypop1k == . & year != 1881 & year != 1871
gen lnewcitypop = log(newcitypop+1)
gen lnewcitypop_nomis = log(newcitypop)
gen lnewcitypop1k = log(newcitypop1k+1)
gen lnewcitypop1k_nomis = log(newcitypop1k)
sort gridcell year
bysort gridcell: gen lag1lnewcitypop = lnewcitypop[_n-1]
bysort gridcell: gen lag1lnewcitypop_nomis = lnewcitypop_nomis[_n-1]
bysort gridcell: gen lag1lnewcitypop1k = lnewcitypop1k[_n-1]
bysort gridcell: gen lag1lnewcitypop1k_nomis = lnewcitypop1k_nomis[_n-1]

* We create a few more variables related to the missions. 
foreach X in missions_yn prot cath presb metho {
bysort gridcell: gen lead1`X' = `X'[_n+1]
bysort gridcell: gen lag1`X' = `X'[_n-1]
bysort gridcell: gen lag2`X' = `X'[_n-2]
bysort gridcell: gen lag3`X' = `X'[_n-3]
}
gen lag1presbmetho = lag1presb-lag1metho

* We create a few variables related to the dependent variable.
gen cityyn = (newcitypop >= 1 & newcitypop != .)
gen cityyn1k = (newcitypop1k >= 1 & newcitypop1k != .)
tab year cityyn
bysort gridcell: gen lag1cityyn = cityyn[_n-1]
bysort gridcell: gen lag1cityyn1k = cityyn1k[_n-1]
gen chgpop = lnewcitypop - lag1lnewcitypop
gen chgpop_nomis = lnewcitypop_nomis - lag1lnewcitypop_nomis
gen chgpop_extmarg = cityyn - lag1cityyn
tab year chgpop_extmarg
gen chgpop1k = lnewcitypop1k - lag1lnewcitypop1k
gen chgpop1k_nomis = lnewcitypop1k_nomis - lag1lnewcitypop1k_nomis
gen chgpop1k_extmarg = cityyn1k - lag1cityyn1k
gen lnewcitypop91 = lnewcitypop if year == 1891
bysort gridcell: egen lnewcitypop1891 = max(lnewcitypop91) 
gen lnewcitypop1k91 = lnewcitypop1k if year == 1891
bysort gridcell: egen lnewcitypop1k1891 = max(lnewcitypop1k91) 

* We only keep some variables to reduce the size of the data set.
keep gridcell year longitude latitude district_2000 ethnic district31 *newcitypop* *missions* *prot* *cath* *presb* *meth* chgpop* minyearopen*

* We now create more variables related to the district and ethnic group fixed effects. 
* district-year FE
egen dist31 = group(district31)
gen dist31yr = string(dist31)+string(year)
drop district31 dist31
codebook dist31yr
* ethnic-year FE
egen ethnicnum = group(ethnic)
gen ethnicyr = string(ethnicnum)+string(year)
drop ethnicnum ethnic
codebook ethnicyr
set matsize 800

* We now only keep the variables we need for the regressions and label them. 
keep gridcell year chgpop1k chgpop1k_extmarg chgpop1k_nomis lag1prot lag1presbmetho lag1missions_yn lag2missions_yn lag1lnewcitypop1k dist31yr ethnicyr lnewcitypop1k1891 longitude latitude minyearopen district_2000

label var gridcell "Grid cell"
label var year "Year"
label var chgpop1k "Change in Log Urban Pop."
label var chgpop1k_extmarg "Change in Log Urban Pop. - Extensive Margin"
label var chgpop1k_nomis "Change in Log Urban Pop. - Intensive Margin"
label var lag1prot "Dummy Prot. Mission t-10"
label var lag1presbmetho "Dummy Presb. Mis. t-10 - Dummy Metho. Mis. t-10"
label var lag1missions_yn "Dummy Mission t-10"
label var lag2missions_yn "Dummy Mission t-20"
label var lag1lnewcitypop1k "Log of the initial urban pop. level in t-10"
label var lnewcitypop1k1891 "Log of the urban pop. in 1897"
label var longitude "Longitude"
label var latitude "Latitude"
label var minyearopen "First year of the opening of a mission"
label var dist31yr "District (1931)-year FE"
label var ethnicyr "Ethnic Group-year FE"
label var district_2000 "District (2000)"
save jjm_jebo_21_panel, replace

* We verify the variables are properly labeled
desc *, f
